NAACL 2025

Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Albuquerque | April 29 – May 4, 2025

AI Chatbots Aren’t Experts on Psych Medication Reactions — Yet

If you think you’re having an adverse drug reaction, it’s best to call a human medical professional, at least for the time being.

Researchers at the Georgia Institute of Technology have developed a new tool to evaluate how well AI chatbots can detect potential adverse drug reactions in chat conversations, and how well their advice aligns with human experts. The study was led by CS Ph.D. student Mohit Chandra (pictured) and School of Interactive Computing Associate Professor Munmun De Choudhury.

By Catherine Barzler, Research Communications

Asking artificial intelligence for advice can be tempting. Powered by large language models (LLMs), AI chatbots are available 24/7, are often free to use, and draw on troves of data to answer questions. Now, people with mental health conditions are asking AI for advice when experiencing potential side effects of psychiatric medicines — a decidedly higher-risk situation than asking it to summarize a report.

One question puzzling the AI research community is how AI performs when asked about mental health emergencies. Globally, including in the U.S., there is a significant gap in mental health treatment, with many individuals having limited to no access to mental healthcare. It’s no surprise that people have started turning to AI chatbots with urgent health-related questions.

Now, researchers at the Georgia Institute of Technology have developed a new framework to evaluate how well AI chatbots can detect potential adverse drug reactions in chat conversations, and how closely their advice aligns with human experts. The study was led by Munmun De Choudhury, J.Z. Liang Associate Professor in the School of Interactive Computing, and Mohit Chandra, a third-year computer science Ph.D. student.

“People use AI chatbots for anything and everything,” said Chandra, the study’s first author. “When people have limited access to healthcare providers, they are increasingly likely to turn to AI agents to make sense of what’s happening to them and what they can do to address their problem. We were curious how these tools would fare, given that mental health scenarios can be very subjective and nuanced.”

De Choudhury, Chandra, and their colleagues will introduce their new framework at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, April 29May 4.

Putting AI to the Test

Going into their research, De Choudhury and Chandra wanted to answer two main questions: First, can AI chatbots accurately detect whether someone is having side effects or adverse reactions to medication? Second, if they can accurately detect these scenarios, can AI agents then recommend good strategies or action plans to mitigate or reduce harm?

The researchers collaborated with a team of psychiatrists and psychiatry students to establish clinically accurate answers from a human perspective and used those to analyze AI responses.

To build their dataset, they went to the internet’s public square, Reddit, where many have gone for years to ask questions about medication and side effects.

They evaluated nine LLMs, including general purpose models (such as GPT-4o and LLama-3.1), and specialized medical models trained on medical data. Using the evaluation criteria provided by the psychiatrists, they computed how precise the LLMs were in detecting adverse reactions and correctly categorizing the types of adverse reactions caused by psychiatric medications.

Additionally, they prompted LLMs to generate answers to queries posted on Reddit and compared the alignment of LLM answers with those provided by the clinicians over four criteria: (1) emotion and tone expressed, (2) answer readability, (3) proposed harm-reduction strategies, and (4) actionability of the proposed strategies.

The research team found that LLMs stumble when comprehending the nuances of an adverse drug reaction and distinguishing different types of side effects. They also discovered that while LLMs sounded like human psychiatrists in their tones and emotions — such as being helpful and polite — they had difficulty providing true, actionable advice aligned with the experts.

Better Bots, Better Outcomes

The team’s findings could help AI developers build safer, more effective chatbots. Chandra’s ultimate goals are to inform policymakers of the importance of accurate chatbots and help researchers and developers improve LLMs by making their advice more actionable and personalized.

Chandra notes that improving AI for psychiatric and mental health concerns would be particularly life-changing for communities that lack access to mental healthcare.

“When you look at populations with little or no access to mental healthcare, these models are incredible tools for people to use in their daily lives,” Chandra said. “They are always available, they can explain complex things in your native language, and they become a great option to go to for your queries.

 “When the AI gives you incorrect information by mistake, it could have serious implications on real life,” Chandra added. “Studies like this are important, because they help reveal the shortcomings of LLMs and identify where we can improve.”

Funding: National Science Foundation (NSF), American Foundation for Suicide Prevention (AFSP), Microsoft Accelerate Foundation Models Research grant program. The findings, interpretations, and conclusions of this paper are those of the authors and do not represent the official views of NSF, AFSP, or Microsoft.

J. Z. Liang Associate Professor, School of Interactive Computing
Georgia Tech

J. Z. Liang Associate Professor, School of Interactive Computing
Georgia Tech


Partner Organizations

Amazon • Arizona State University • Birla Institute of Technology and Science • Brown University • Columbia University • Cornell University • Dartmouth College • Dhirubhai Ambani Institute Of Information and Communication Technology • Essential AI • Facebook • Georgia Tech • Google DeepMind • Hofstra University • Intel Labs • Korea Advanced Institute of Science & Technology • Meta • Michigan State University • Microsoft • Northwell Health • Pennsylvania State University • Purdue University • Quantexa • Stanford University • Texas A&M University – College Station • Universidad Complutense de Madrid • University of Arizona • University of California, Berkeley • University of California, Los Angeles • University of California, San Diego • University of Illinois at Urbana-Champaign • University of Massachusetts at Amherst • University of Michigan – Ann Arbor • University of Toronto • Zhejiang University •




Research Finds Language Models Align Unevenly with Human Social and Moral Norms

We prompted 11 language models (GPT-4o, Gemini, Llama-3, Arctic, and others) with 400 rules of thumb (RoTs) from the Social Chemistry 101 dataset.

These RoTs represent everyday social and moral norms, such as:

  • “It’s good to work at home.”
  • “It is good to be patient.”
  • “It is ok to live with a roommate of the opposite sex if you are just friends.”

Each RoT was previously labeled by 50 U.S.-based annotators (from a pool of 100 total), spanning a range of age, gender, and income groups.

We compare model responses to these human norms using ADA-Met, a simple ordinal metric that measures how far a model’s response diverges from the modal human norm across demographic groups.

Recognizing that LLMs are increasingly used for subjective judgments, we emphasize the importance of knowing whose opinions these models reflect. Social and moral norms, which vary across cultures and societies, are central to these judgments. Our study revealed LLMs don’t capture a broad range of human perspectives, risk reinforcing stereotypes, and can contribute to unequal treatment.

Key findings:

  • Most models align more closely with younger, higher-income, unmarried individuals
  • Some models refused to answer sensitive RoTs, limiting normative coverage
  • Prompt structure (e.g., using markdown tables) improves alignment

Huge thanks to my co-authors Agam Shah, Dipanwita Guhathakurta, Poojitha Nandigam, Sudheer Chava, and the Georgia Tech Financial Services Innovation Lab.

Michael Galarnyk
ML Ph.D. student at Georgia Tech

Sudheer Chava
Alton M. Costley Chair Professor • Finance


Tarek Naous
ML Ph.D. student at Georgia Tech

Wei Xu
Associate Professor • Interactive Computing

Study Reveals Why Language Models Struggle with Arab Culture

Our study investigated why language models (LMs) often show a bias towards Western culture when working with Arabic, a non-Western language. We explored several factors, including the data that LMs are trained on and the linguistic differences between languages. To aid in this, we created a new dataset called CAMeL-2, which contains over 58,000 entities (names, places, etc.) from both Arab and Western cultures with examples in both Arabic and English.

CAMeL-2 was used to test various LMs on tasks like answering questions and identifying entities in both languages. By comparing the models’ performance in English versus Arabic, we aimed to pinpoint the sources of the cultural bias.

Key findings:

  • LMs performed better at understanding Arab cultural information when tested in English compared to Arabic.
  • LMs struggled in Arabic with high-frequency Arab entities that have multiple meanings (polysemy). For example, a word could be a food and also have another meaning. This issue was less common with Western entities transliterated into Arabic.
  • When Arab entities had similar spellings to common words in other languages that use the Arabic script (like Farsi or Urdu), LMs also had more difficulty recognizing them in Arabic.
  • The way words are broken down into smaller units (tokens) affected performance. LMs struggled with Arab entities that were tokenized into a single unit, especially if that unit was a polysemous word in Arabic. This problem worsened for models with larger Arabic vocabularies.

The study suggests that cultural bias in LMs isn’t just about the amount of Western data. The unique features of the Arabic language, like words having multiple meanings and similarities to other script-sharing languages, along with how these languages are processed by LMs, also play a significant role. We believe this understanding is crucial for building more equitable multilingual LMs.

Study Reveals Why Language Models Struggle with Arab Culture

Our study investigated why language models (LMs) often show a bias towards Western culture when working with Arabic, a non-Western language. We explored several factors, including the data that LMs are trained on and the linguistic differences between languages. To aid in this, we created a new dataset called CAMeL-2, which contains over 58,000 entities (names, places, etc.) from both Arab and Western cultures with examples in both Arabic and English.

CAMeL-2 was used to test various LMs on tasks like answering questions and identifying entities in both languages. By comparing the models’ performance in English versus Arabic, we aimed to pinpoint the sources of the cultural bias.

Key findings:

  • LMs performed better at understanding Arab cultural information when tested in English compared to Arabic.
  • LMs struggled in Arabic with high-frequency Arab entities that have multiple meanings (polysemy). For example, a word could be a food and also have another meaning. This issue was less common with Western entities transliterated into Arabic.
  • When Arab entities had similar spellings to common words in other languages that use the Arabic script (like Farsi or Urdu), LMs also had more difficulty recognizing them in Arabic.
  • The way words are broken down into smaller units (tokens) affected performance. LMs struggled with Arab entities that were tokenized into a single unit, especially if that unit was a polysemous word in Arabic. This problem worsened for models with larger Arabic vocabularies.

The study suggests that cultural bias in LMs isn’t just about the amount of Western data. The unique features of the Arabic language, like words having multiple meanings and similarities to other script-sharing languages, along with how these languages are processed by LMs, also play a significant role. We believe this understanding is crucial for building more equitable multilingual LMs.

Tarek Naous
ML Ph.D. student at Georgia Tech

Wei Xu
Associate Professor • Interactive Computing


UNLEARN Forgets Knowledge to Preserve Data Privacy, Shows Big Gains

This research tackles the growing need to efficiently remove specific information from large language models (LLMs) without having to retrain the entire model. The paper introduces a new technique called UNLEARN that can selectively forget knowledge, which is increasingly important due to data privacy regulations like the ‘Right to be Forgotten’ laws. Traditional methods for removing knowledge are often inefficient and can negatively impact other knowledge the model has learned.

The UNLEARN method works by first identifying the specific area (subspace) within the LLM’s internal workings that is responsible for the knowledge to be forgotten. It then uses a process called subspace discrimination to separate this targeted knowledge from similar knowledge, ensuring that removing one doesn’t harm the other. UNLEARN can then remove the identified subspace, effectively making the model forget the targeted information.

Key Findings:

  • UNLEARN achieved a high forgetting rate of 96% on targeted knowledge while maintaining performance on dissimilar tasks within 2.5% of the original model’s performance. This demonstrates a significant improvement over previous methods in selectively removing information.
  • When dealing with similar tasks, UNLEARN achieved nearly 80% forgetting on the targeted task while preserving performance on similar tasks within 10%. This highlights UNLEARN’s ability to discriminate between closely related knowledge, a challenge for existing unlearning techniques.
  • The study also introduced LEARN, a dual method to UNLEARN, which can add new knowledge to an LLM and match the fine-tuning accuracy of LoRA without negatively affecting other tasks. This showcases the versatility of the underlying approach for both knowledge removal and addition.

UNLEARN represents a significant advancement in the ability to efficiently and precisely remove knowledge from LLMs without requiring access to the training data and without causing unwanted side effects on other learned information. This has important implications for data privacy, security, and the efficient adaptation of large language models.

Tyler Lizzo
ECE Ph.D. student at Georgia Tech

Larry Heck
Professor • Electrical and Computer Engineering & Interactive Computing


VL‑Time Method Enables Language Models to Reason about Time Series Data

Our research team explored the capability of large language models (LLMs) to reason about time‑series data, which are recordings of data changing over time and common in many real‑world applications. We found that LLMs often struggle with this type of reasoning when the data is presented as simple numbers. To better understand these limitations, we created TimerBed, a new and comprehensive testbed for evaluating how well LLMs can handle time‑series reasoning. TimerBed includes different types of reasoning tasks based on real‑world data, uses various advanced LLMs and reasoning strategies, and provides benchmarks for comparison.

Our work revealed that LLMs generally perform poorly in time‑series reasoning when directly given numerical data, often performing no better than random guessing. This failure might be because it’s difficult for LLMs to extract important features and handle the long sequences of numbers typically found in time‑series data. To address this issue, we developed a new prompt‑based method called VL‑Time. Instead of feeding numbers directly, VL‑Time uses visualizations (like graphs) of the time‑series data, combined with language‑based instructions to guide the LLMs’ reasoning process.

Key findings:

  • VL-Time significantly improves the ability of multimodal LLMs to reason about time series. It achieved an average performance improvement of 140% and up to 433% compared to using numerical data directly.
  • It enables multimodal LLMs to perform non-trivial zero-shot reasoning on time series, meaning they can reason about new tasks without prior examples. This is a notable improvement from the near-random performance observed when using numerical data in a zero-shot setting.
  • VL-Time makes LLMs powerful few-shot reasoners for time series. With just a few examples, VL-Time allowed LLMs to outperform all tested supervised time-series models on tasks involving simple and complex deterministic reasoning.

Haoxin Liu
CS Ph.D. student at Georgia Tech

B. Aditya Prakash
Associate Professor • Computational Science and Engineering


Main

Computational Social Science and Cultural Analytics

Communication Makes Perfect: Persuasion Dataset Construction via Multi-LLM Communication
Weicheng Ma, Hefan Zhang, Ivory Yang, Shiyu Ji, Joice Chen, Farnoosh Hashemi, Shubham Mohole, Ethan Gearey, Michael Macy, Saeed Hassanpour, Soroush Vosoughi


Generation

Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu


Human-Centered NLP

Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use
Mohit Chandra, Siddharth Sriraman, Gaurav Verma, Harneet Singh Khanuja, Jose Suarez Campayo, Zihang Li, Michael L. Birnbaum, Munmun De Choudhury

Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens


Information Extraction

GLiREL – Generalist Model for Zero-Shot Relation Extraction
Jack Boylan, Chris Hokamp, Demian Gholipour Ghalandari


Language Modeling

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang

Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou


Low-resource Methods for NLP

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations
Ziqiao Ma, Zekun Wang, Joyce Chai


NLP Applications

A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization
Haoxin Liu, Chenghao Liu, B. Aditya Prakash

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, Diyi Yang

Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping
Ryan Li, Yanzhe Zhang, Diyi Yang


Phonology, Morphology, and Word Segmentation

The Impact of Visual Information in Chinese Characters: Evaluating Large Models’ Ability to Recognize and Utilize Radicals
Xiaofeng Wu, Karl Stratos, Wei Xu


Resources and Evaluation

CausalEval: Towards Better Causal Reasoning in Language Models
Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Dawei Li, Zhikai Chen, Xiaoze Liu, Liangming Pan

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael Littman, Stephen Bach


Special Theme

A Survey of NLP Progress in Sino-Tibetan Low-Resource Languages
Shuheng Liu, Michael Best

Is It Navajo? Accurate Language Detection for Endangered Athabaskan Languages
Ivory Yang, Weicheng Ma, Chunhui Zhang, Soroush Vosoughi

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Tarek Naous, Wei Xu


Findings

Dialogue and Interactive Systems

Adapting LLM Agents with Universal Communication Feedback
Kuan Wang, Yadong Lu, Michael Santacroce, Yeyun Gong, Chao Zhang, yelong shen


Ethics, Bias, and Fairness

LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal


Information Extraction

BioEL: A Comprehensive Python Package for Biomedical Entity Linking
Prasanth Bathala, Christophe Ye, Batuhan Nursal, Shubham Lohiya, David Kartchner, Cassie S. Mitchell


Low-resource Methods for NLP

UNLEARN Efficient Removal of Knowledge in Large Language Models
Tyler Lizzo, Larry Heck


NLP Applications

Do Large Language Models Align with Core Mental Health Counseling Competencies?
Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, Munmun De Choudhury

From Intentions to Techniques: A Comprehensive Taxonomy and Challenges in Text Watermarking for Large Language Models
Harsh Nishant Lalai, Aashish Anantha Ramakrishnan, Raj Sanjay Shah, Dongwon Lee


Special Theme

How Inclusively do LMs Perceive Social and Moral Norms?
Michael Galarnyk, Agam Shah, Dipanwita Guhathakurta, Poojitha Nandigam, Sudheer Chava

See you in Albuquerque!

Go to top ⬆️

Development: College of Computing
Project and Web Lead/Data Graphics: Joshua Preston
Featured News: Catherine Barzler
Photography: Kevin Beasley and Terence Rushin; submitted photos
Data: https://2025.naacl.org/program/accepted_papers/